simdjson talk

2022-03-26 · 4 min read

"Data Engineering at the Speed of Your Disk" - Daniel Lemire - 2020

Modern disks are pretty fast
- PCIe 4 disks: 5 GB/s sequential read
Modern datacenter VPC network real fucking fast
- 50 Gb/s (6.25 GiB/s) and better
If we can't eat data at many GB/s, we'll literally be CPU-bound reading from disk or network!
How fast can we allocate memory?

// almost instant (just virtual memory)
let mut buf = Vec::with_capacity(size);

// actually instantiate physical pages
for i in (0..buf.len()).step_by(page_size) {
	buf[i] = 0;
}

3.5 GB/s (Linux, Skylake 3.4 GHz, 4 KiB pages)

How fast can we, e.g., remove spaces from a string?
- Assume 1% space density (minified json)

for i in 0..buf.len() {
	if buf[i] > 32 /* space as decimal = 32 */ {
		buf[i + 1] = buf[i];
	}
}

1.6 GB/s

Just working byte-by-byte will already have limit of ~3.5 GB/s b/c our processor is 3.5 GHz.
We're already slower than disk if we work byte-by-byte.
so how are we going to go faster? SIMD : )

AVX-512 instructions can operate on an entire cache-line at once : 0

Working with SIMD
(1) use SIMD intrinsics. produces very disgusting code though.
(2) pray / massage code so compiler auto-vectorizes : )
despacer with SIMD

// filled with spaces
__m128i spaces = _mm_set1_epi8(' ');

for (i = 0; i + 15 < size; i += 16) {
	// load next chunk
	__m128i x = _mm_loadu_si128(bytes + i);

	// set a 1 byte anywhere we have an x byte <= 32
	__m128i anywhite = _mm_cmpeq_epi8(spaces, _mm_max_epu8(spaces, x));

	// condense anywhite mask into a packed u64 mask
	uint64_t mask16 = _mm_movemask_epi8(anywhite);

	// basically a pdep, but deposit the non-space bytes at the front
	x = _mm_shuffle_epi8(x, despace_mask16[mask16 & 0x7fff]);

	// store in-place
	_mm_storeu_si128(bytes + pos, x);
	// increment our in-place output pointer only by the # non-spaces.
	pos += 16 - mask16.count_ones();
}

base64 encode / decode lemire/fastbase64
- can encode w/ SIMD in only 3 instructions : 0 (mod loads/stores)
- decoding closer to 7 instructions
- once your data is not just in L1/L2 cache, this is just as fast as a memcpy
utf-8 encode / decode
byte-by-byte approach only 300 MB/s D :

if b1 < 0x80 {
	return true; // ascii
}
if b1 < 0xe0 {
	if b1 < 0xc2 || b2 > 0xbf {
		return false;
	}
} else if b1 < 0xf0 {
	// 3 byte form
	if b2 > 0xbf || (b1 == 0xe0 && b2 < 0xa0) || (b1 == 0xed && b2 >= 0xa0) {
		// ..
	} else {
		// ..
	}
} else {
	// 4 byte form
	// ..
}

using SIMD: 8 GB/s
using 32-byte chunks, ~20 instructions. branch-less
https://github.com/lemire/fastvalidate-utf-8
JSON for Modern C++ library (nlohmann-json)
0.1 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
rapidjson
0.3 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
both of these libs nowhere near disk/network speed
what is our "roofline" benchmark?

let mut all_line_lens = 0;
for line in reader.lines() {
	all_line_lens += line.len();
}

1.4 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
but you can actually go faster!
simdjson
2.5 GB/s (Skylake 3.4 GHz, GCC8, file: twitter.json)
some of the tricks used in simdjson
Find the span of the string
- Given a mask containing quotes, return a mask set to 1 in the string span regions.
- __1________1________1___1 quotes
- ___11111111__________111_ string regions

mask = quote xor (quote << 1);
mask = mask xor (mask << 2);
mask = mask xor (mask << 4);
mask = mask xor (mask << 8);
// ..

string span can actually be done in one instruction : 0
Number parsing is expensive
- strtod
- 90 MB/s
- 38 cycles / byte
- 10 branch misses / float
- yikes!
SIMD approach

fn contains_8_digits(buf: &[u8; 8]) -> bool {
	let val = u64::from_ne_bytes(buf);
	(   (val & 0xf0f0_f0f0_f0f0_f0f0) |
		(((val + 0x0606_0606_0606_0606) & 0xf0f0_f0f0_f0f0_f0f0) >> 4)
	) == 0x3333_3333_3333_3333
}

fn parse_8_digits_unrolled(buf: &[u8; 8]) -> u32 {
	let mut val = u64::from_ne_bytes(buf);
	val = ((val & 0x0f0f_0f0f_0f0f_0f0f) * 2561) >> 8;
	val = ((val & 0x00ff_00ff_00ff_00ff) * 6553601) >> 16;
	((val & 0x0000_ffff_0000_ffff) * 42949672960001) >> 32
}

https://github.com/lemire/fast-float-rust